[staging] PR-5351 cross-OS validation (doc extraction + 3 fixes)#89
Open
danielhanchen wants to merge 13 commits into
Open
[staging] PR-5351 cross-OS validation (doc extraction + 3 fixes)#89danielhanchen wants to merge 13 commits into
danielhanchen wants to merge 13 commits into
Conversation
Expose and propagate the server-side document extraction concurrency limit so clients can cap their parallel extractions and avoid 503 busy responses. Backend: export _EXTRACT_CONCURRENCY, add max_extract_concurrency to DocumentSupportResponse, surface a default on import failure, and compute a larger chat JSON body limit (_OPENAI_CHAT_BODY_MAX_BYTES) that accounts for embedded image payloads. Also create the extraction task waiter once to properly drain exceptions. Tests updated to assert the new values and chat body behavior. Frontend: add setExtractionBackendLimit and apply the backend limit in the extraction queue and UI (cap slider, adjust stored settings), call setExtractionBackendLimit when caching document support, and tweak NDJSON progress wording from “uploaded” to “processed.”
for more information, see https://pre-commit.ci
Stub document-parser availability in chat document route tests to avoid spurious 501 errors in CI when optional parsers are missing, by monkeypatching _document_parser_support and _document_parser_unavailable_reasons. Update several AST-based tests to consider both the public handler (openai_chat_completions) and the implementation function (_openai_chat_completions_impl), preferring the impl when present, so structural checks still pass after streaming bodies were moved into the implementation. Changes touch test_chat_document_routes.py and multiple tests in test_stream_cancel_registration_timing.py.
…rll/unsloth into document-extractor-refactor
…rence coupling Three bugs in the document-extraction refactor (PR unslothai#5351) that produced real user-visible regressions on the unmodified PR head: 1. GGUF singleton split (P1). routes.inference defined an eager _llama_cpp_backend = LlamaCppBackend() module-level instance whose get_llama_cpp_backend() wrapper shadowed the canonical accessor in core.inference.llama_cpp. /api/inference/load wrote to the route instance; routes.models.list_models / cache-delete, run.py shutdown and core.chat.vlm_capability._probe_gguf all read the core instance. So a GGUF loaded the normal way was invisible to /api/models/list, deletable from cache while serving, leaked at shutdown, and the VLM probe never saw it. Removed the eager instance and wrapper; the existing top-of-file import from core.inference.llama_cpp now re-exports the canonical accessor. 2. Extractor semaphore leak (High). _run_extract_process_sync acquired _EXTRACT_SEMAPHORE before its try/finally block, so if multiprocessing.get_context, ctx.Queue or ctx.Process raised (OSError on fork-resource exhaustion, EAGAIN on Windows under load, Queue allocation failure on hardened sandboxes), the permit was leaked and the queue eventually deadlocked. Moved the try upwards to cover the multiprocessing setup; the finally now also terminates a partially-started worker. 3. HTML cleanup pulled in the inference stack (Medium). _extract_html did from core.inference._html_to_md import html_to_markdown. core.inference/__init__.py eagerly imported .orchestrator and .llama_cpp, so any failure deep in the inference dep chain made _extract_html fall back to raw HTML, splicing <script>/<style> tags into the outgoing prompt. Switched core.inference to PEP 562 __getattr__ lazy resolution so the stdlib-only HTML helper imports without dragging in the orchestrator or llama-server backend. Regression coverage: - tests/studio/test_gguf_singleton_shared.py asserts the routes.inference / core.inference.llama_cpp accessors return the same instance and that the VLM probe sees a route-loaded GGUF. - tests/studio/test_extractor_semaphore_leak.py monkeypatches each multiprocessing failure point and asserts _EXTRACT_SEMAPHORE._value is restored. - tests/studio/test_html_independent_of_inference.py runs in a fresh subprocess with poisoned core.inference.orchestrator/llama_cpp entries and asserts <script>/<style> are stripped. Full suite (PR head + fixes): 987 passed / 14 pre-existing failures unchanged (GPU/transformers/llama-server-binary dependent tests).
| await disconnect_task | ||
|
|
||
| return StreamingResponse( | ||
| _ndjson_stream(), |
Comment on lines
+28
to
+60
| runs-on: macos-14 | ||
| timeout-minutes: 25 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend test dependencies (CPU only) | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install \ | ||
| python-multipart aiofiles sqlalchemy cryptography \ | ||
| pyyaml jinja2 mammoth unpdf requests \ | ||
| 'numpy<3' pytest pytest-asyncio httpx | ||
| pip install --index-url https://download.pytorch.org/whl/cpu 'torch>=2.4,<2.11' | ||
| pip install 'transformers>=4.51,<5.5' | ||
|
|
||
| - name: PR-5351 document tests (macOS spawn semantics) | ||
| working-directory: studio/backend | ||
| env: | ||
| # macOS's default start method is spawn; exercise the same | ||
| # config users see in production. | ||
| UNSLOTH_STUDIO_EXTRACT_CONCURRENCY: '2' | ||
| run: | | ||
| python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short | ||
|
|
||
| - name: PR-5351 regression tests + cancel timing | ||
| run: | | ||
| python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short |
Comment on lines
+29
to
+57
| runs-on: ubuntu-latest | ||
| timeout-minutes: 20 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend test dependencies (CPU only) | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install \ | ||
| python-multipart aiofiles sqlalchemy cryptography \ | ||
| pyyaml jinja2 mammoth unpdf requests \ | ||
| 'numpy<3' pytest pytest-asyncio httpx | ||
| pip install --index-url https://download.pytorch.org/whl/cpu 'torch>=2.4,<2.11' | ||
| pip install 'transformers>=4.51,<5.5' | ||
|
|
||
| - name: PR-5351 document tests | ||
| working-directory: studio/backend | ||
| run: | | ||
| python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short | ||
|
|
||
| - name: PR-5351 regression tests + cancel timing | ||
| run: | | ||
| python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short |
Comment on lines
+29
to
+59
| runs-on: windows-latest | ||
| timeout-minutes: 30 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend test dependencies (CPU only) | ||
| shell: pwsh | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install python-multipart aiofiles sqlalchemy cryptography pyyaml jinja2 mammoth unpdf requests "numpy<3" pytest pytest-asyncio httpx | ||
| pip install --index-url https://download.pytorch.org/whl/cpu "torch>=2.4,<2.11" | ||
| pip install "transformers>=4.51,<5.5" | ||
|
|
||
| - name: PR-5351 document tests (Windows spawn semantics) | ||
| working-directory: studio/backend | ||
| shell: pwsh | ||
| env: | ||
| UNSLOTH_STUDIO_EXTRACT_CONCURRENCY: '2' | ||
| run: | | ||
| python -m pytest -q tests/test_chat_document_extraction.py tests/test_chat_document_routes.py tests/test_inference_worker.py tests/test_vision_cache.py tests/test_anthropic_messages.py tests/test_openai_tool_passthrough.py tests/test_models_get_model_config_case_resolution.py --tb=short | ||
|
|
||
| - name: PR-5351 regression tests + cancel timing | ||
| shell: pwsh | ||
| run: | | ||
| python -m pytest -q tests/studio/test_extractor_semaphore_leak.py tests/studio/test_html_independent_of_inference.py tests/studio/test_gguf_singleton_shared.py tests/studio/test_stream_cancel_registration_timing.py --tb=short |
Real-world testing (Orimi test PDF, RFC 8259 PDF, "Attention Is All You
Need", calibre demo DOCX) plus an additional review pass surfaced five
follow-ups on top of the earlier singleton/semaphore/HTML fix:
1. Null-password PDFs were rejected as encrypted.
The classic Orimi PDF, Acrobat distilled scans, and a long tail of
PDFs in the wild carry a /Encrypt dict with an empty user password
so the file opens without prompting. pypdf.PdfReader.is_encrypted
and PyMuPDF's doc.is_encrypted both flag them, but the canonical
"needs a password" signal is PyMuPDF's needs_pass. The preflight
in routes.inference._preflight_pdf_page_count and the extractor in
core.chat.document_extractor._extract_pdf now refuse only when
needs_pass is True. pypdf's branch tries decrypt("") first and
falls through to PyMuPDF on failure.
2. Worker put-result-then-die race.
_run_extract_process_sync could observe proc.is_alive() == False
after the worker had already queued a successful result, exit the
loop with message=None, and surface a RuntimeError. Both the
in-loop is_alive() branch and the post-join branch now perform a
final result_queue.get_nowait() before declaring failure.
3. macOS multiprocessing start method.
The ternary picked "fork" on macOS, which is unsafe with Quartz /
PyObjC / PyMuPDF's CoreFoundation linkage. macOS now uses "spawn"
like Windows; Linux keeps "fork" for the CoW pickling win.
4. NDJSON streaming InvalidStateError on shield-cancel race.
The streaming NDJSON loop accepted extract_wait completion as a
signal to call extraction_task.result(). When asyncio.shield's
outer future was cancelled before the inner task finished, that
raised InvalidStateError and surfaced as a generic HTTP 500.
The branch now waits for extraction_task.done() and re-arms a
fresh shielded future when only the outer wrapper completes.
5. PaddleOCR-VL nondeterministic inference defaults.
Shipped temperature=1.5, min_p=0.1 -- causes hallucinated glyphs
and reorderings on a closed-form transcription task. Aligned with
the sibling OCR presets (DeepSeek-OCR, GLM-OCR) at temperature=0.0,
top_p=1.0, top_k=-1, min_p=0.0.
Regression test additions:
- tests/studio/test_pseudo_encrypted_pdf.py mints a null-password PDF
with PyMuPDF, asserts both the preflight and _extract_pdf accept it,
and confirms a real password-required PDF still raises
DocumentExtractionEncrypted.
Also drops importlib.reload from test_extractor_semaphore_leak.py: the
reload swapped _drain_future_exception out from under routes.inference,
breaking an existing identity assertion. The new fixture snapshots and
restores the semaphore counter instead.
Local: studio backend suite + 4 regression files: 91/91 PR-relevant
tests pass; the remaining 9 failures are the pre-existing
gpu_selection / kv_cache_estimation / help_output tests unchanged.
8bc38f0 to
3093b18
Compare
Mirrors the request-body hardening this PR already added to the sibling JSON inference endpoints (/v1/chat/completions at :1674, /v1/anthropic/messages at :2769, /v1/anthropic/messages_count at :2850). /api/inference/cancel still used await request.json() with no streaming cap, so an authenticated client could force the server to buffer arbitrarily large bodies and slip past the exact overflow hardening this PR added elsewhere. Switched to _read_json_body_limited(request, max_bytes=64 KiB). The real cancel payload is a small dict of identifiers (cancel_id, completion_id, session_id, message_id); 64 KiB is generous and matches the cap pattern used in the other authenticated route handlers. Stream-cancel registration timing (test_stream_cancel_registration_timing + test_cancel_atomicity + test_cancel_id_wiring) is unchanged.
3093b18 to
13992ca
Compare
Reconcile 11 conflicted files spanning backend and frontend with 248 commits of main since the PR diverged: backend - studio.txt: keep PR additions (pypdf, python-multipart, onnxruntime) alongside main additions (cryptography, httpx) - routes/inference.py: merge Pydantic model imports, keep cache_type_kv, combine trust_remote_code preflight with main's offline DNS guard, adopt main's _openai_messages_for_gguf_chat refactor, route through _normalize_openai_message_images, retain main's _drop_empty_assistant_sentinels and the is_vision kwarg - tests/test_anthropic_messages.py: keep PR's test_image_count_limit alongside main's TestAnthropicRequestedStudioTools and TestAnthropicMessagesToolRouting suites frontend - types/api.ts: unify OpenAIMessageContentPart with reasoning, image_generation_call, and image_url.detail; retain OpenAIChatContentPart alias for prior callers - types/runtime.ts: combine loadIn4Bit (PR) with fastMode (main) - assistant-ui/attachment.tsx: combine PR's document chip and unique attachmentDomId with main's filename-in-aria-label - assistant-ui/thread.tsx: keep PR's blocking-attachment send guard, thread the disabled state through main's ThreadComposerDock outside the viewport, drop the obsolete inline ComposerAnimated block, consolidate Composer to main's disabled/threadId API - chat-settings-sheet.tsx: combine PR's DocumentExtractionSection and preset state-machine reset wiring with main's external connections flow, removing duplicated preset useState now sourced from the store - chat/api/chat-adapter.ts: keep PR's appendContentParts/mergeAdjacentTextParts alongside main's reasoning normalizer, image-edit reference path, Anthropic refusal pruning, disabled-tool guard, and image gate - chat/hooks/use-chat-model-runtime.ts: keep both cache_type_kv and storedReasoningEnabled hydration - chat/runtime-provider.tsx: keep PR's document-extraction adapter, drop in-browser Text/Html/Docx adapters now handled by the backend, preserve main's listStoredChatMessages-based history and OpenDocumentAttachmentAdapter - chat/shared-composer.tsx: combine PR's document upload + retry pipeline with main's image-availability gate and generalized-compare guard - chat/stores/chat-runtime-store.ts: retain PR's DocExtractSettings, OcrPhase, and document support cache invalidation alongside main's preset-aware settings hydration, debounced saveSettingsPatch, beforeunload flush, scalar setting versions, last-external-checkpoint persistence, and maxTokens clamping
Drops studio-frontend-ci.yml, studio-inference-smoke.yml, studio-tauri-smoke.yml, wheel-smoke.yml, release-desktop.yml and stale.yml from this staging branch so the matrix stays below the 5-concurrent-Windows-runner cap. Keeps studio-backend-ci.yml as the Ubuntu sanity baseline. Adds three lanes that re-run the PR-5351 backend tests plus the three regression tests added in the fix commit: - pr5351-ubuntu.yml: ubuntu-latest, Python 3.11. CUDA spoof in tests/conftest.py engages on CPU runners. - pr5351-macos.yml: macos-14 (arm64). Exercises the multiprocessing spawn start-method and the MLX branch in core.chat.vlm_capability. - pr5351-windows.yml: windows-latest. Validates spawn + path normalisation + Process-construction-under-pressure (exactly the EAGAIN class the semaphore-leak fix protects against). Each workflow gates on paths: studio/backend/**, tests/studio/**, tests/conftest.py and its own file so unrelated commits do not re-trigger.
13992ca to
7b6d399
Compare
| ): void { | ||
| // Callers set must_change_password via setMustChangePassword(). Routing it | ||
| // through here would let CodeQL trace the boolean to localStorage and flag | ||
| // the (deliberate) JWT writes as sensitive-info storage. |
| let response = await postLogout(getAuthToken()); | ||
| if (response && response.status === 401 && getRefreshToken()) { | ||
| const refreshed = await refreshSession(); | ||
| if (refreshed) response = await postLogout(getAuthToken()); |
Adds a CPU end-to-end smoke that exercises: - the PR's `_extract_pdf` against an in-process synthetic PDF - llama-cpp-python (CPU build) loading Qwen2.5-0.5B-Instruct GGUF - inference on the extracted markdown with a ground-truth question Runs on ubuntu-latest, macos-14, and windows-latest with no GPU. Disables Metal on macOS and native autodetect on Windows/Linux so the lanes stay strictly CPU. Path-filtered to studio/backend/core/chat/, the test itself, and each workflow file so unrelated commits don't re-trigger.
Comment on lines
+26
to
+52
| runs-on: macos-14 | ||
| timeout-minutes: 40 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend + llama-cpp-python (CPU build) | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install \ | ||
| python-multipart aiofiles sqlalchemy cryptography \ | ||
| pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio \ | ||
| pytest-timeout huggingface_hub requests numpy | ||
| # Disable Metal so the lane stays CPU-only; mirrors a no-GPU Mac. | ||
| CMAKE_ARGS="-DGGML_METAL=OFF -DGGML_ACCELERATE=OFF -DGGML_NATIVE=OFF" \ | ||
| pip install --upgrade --quiet llama-cpp-python | ||
|
|
||
| - name: CPU inference on extracted document | ||
| env: | ||
| PR5351_LLAMA_THREADS: '3' | ||
| run: | | ||
| python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short |
Comment on lines
+28
to
+53
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend + llama-cpp-python (CPU build) | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install \ | ||
| python-multipart aiofiles sqlalchemy cryptography \ | ||
| pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio \ | ||
| pytest-timeout huggingface_hub requests numpy | ||
| # CPU wheel ships pre-built on Linux; falls back to source if needed. | ||
| CMAKE_ARGS="-DGGML_NATIVE=OFF" pip install --upgrade --quiet llama-cpp-python | ||
|
|
||
| - name: CPU inference on extracted document | ||
| env: | ||
| PR5351_LLAMA_THREADS: '4' | ||
| run: | | ||
| python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short |
Comment on lines
+25
to
+49
| runs-on: windows-latest | ||
| timeout-minutes: 40 | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.11' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install backend + llama-cpp-python (CPU build) | ||
| shell: pwsh | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -r studio/backend/requirements/studio.txt | ||
| pip install python-multipart aiofiles sqlalchemy cryptography pyyaml jinja2 mammoth pymupdf pymupdf4llm pytest pytest-asyncio pytest-timeout huggingface_hub requests numpy | ||
| $env:CMAKE_ARGS = "-DGGML_NATIVE=OFF" | ||
| pip install --upgrade --quiet llama-cpp-python | ||
|
|
||
| - name: CPU inference on extracted document | ||
| shell: pwsh | ||
| env: | ||
| PR5351_LLAMA_THREADS: '4' | ||
| run: | | ||
| python -m pytest -q tests/studio/test_cpu_inference_on_extracted_document.py -s --tb=short |
Two corrections after the first run:
- Point at Qwen/Qwen2.5-0.5B-Instruct-GGUF (the canonical Qwen
team's repo); the unsloth/* fork at that name does not exist
and returned 401 on all three runners.
- Pass the required `max_figures`, `use_vlm_ocr`, and
`max_visual_payloads` kwargs to `_extract_pdf`.
Verified locally on the merge tip:
PYTHONPATH=studio/backend python -c '...' -> extracted 97 chars
including the expected 'RFC 8259' / 'JSON' tokens.
| "InferenceBackend", | ||
| "InferenceOrchestrator", | ||
| "get_inference_backend", | ||
| "get_llama_cpp_backend", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Staging-only PR to validate unslothai#5351 (Studio document extraction with runtime VLM probe and bounded concurrency) on Ubuntu, macOS-14 and Windows runners.
Contents:
This branch is not intended to land upstream; the validated changes will be proposed against unslothai#5351 directly.